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Data mining techniques are applied in many applications as a standard 
procedure for analyzing the large volume of available data, extracting useful 
information and knowledge to support the major decision-making processes. 
Diabetes mellitus is a continuing, general, deadly syndrome occurring all 
around the world. It is characterized by hyperglycemia occurring due to 
abnormalities in insulin secretion which would in turn result in irregular rise 
of glucose level. In recent years, the impact of Diabetes mellitus has 
increased to a great extent especially in developing countries like India. This 
is mainly due to the irregularities in the food habits and life style. Thus, early 
diagnosis and classification of this deadly disease has become an active area 
of research in the last decade. Numerous clustering and classifications 
techniques are available in the literature to visualize temporal data to identify 
trends for controlling diabetes mellitus. This work presents an experimental 


study of several algorithms which classifies Diabetes Mellitus data 
effectively. The existing algorithms are analyzed thoroughly to identify their 
advantages and limitations. The performance assessment of the existing 
algorithms is carried out to determine the best approach. 
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1. 


INTRODUCTION 
Diabetes mellitus is a group of metabolic diseases in which a person experiences high blood glucose 


levels either because the body produces inadequate insulin or the body cells do not respond properly to the 
insulin produced by the body. Patients with diabetes often experience frequent urination (polyuria), increased 
thirst (polydipsia) and increased hunger (polyphagia) [1], [2]. The 3 Types of Diabetes: 


a. 


Type | Diabetes 

In this type of diabetes, the body does not produce enough insulin. This type pf diabetes is also referred to 
as insulin-dependent diabetes, juvenile diabetes or early-onset diabetes. Type 1 diabetes usually develops 
before a person is 40-years-old i.e., in early adulthood or teenage. Patients with type 1 diabetes will need to 
take insulin injections for the rest of their life. They must also ensure proper blood-glucose levels by 
carrying out regular blood tests and following a special diet. 

Type 2 Diabetes 

In Type 2 Diabetes, the body does not produce enough insulin or the cells in the body display insulin 
resistance. Some people may be able to control their type 2 diabetes symptoms by losing weight, following 
a healthy diet, doing plenty of exercise, and monitoring their blood glucose levels. However, type 2 
diabetes is typically a progressive disease — it gradually gets worse — and the patient will probably end up 
having to take insulin, usually in tablet form. Being overweight, physically inactive and eating the wrong 
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foods all contribute to our risk of developing type 2 diabetes. The risk of developing Type 2 diabetes also 
increases with age [3], [4]. 
c. Gestational Diabetes 

This type affects females during pregnancy. Some women have very high levels of glucose in their blood, 
and their bodies are unable to produce enough insulin to transport all of the glucose into their cells, 
resulting in progressively rising levels of glucose. The majority of gestational diabetes patients can control 
their diabetes with exercise and diet. Between 10% to 20% of them will need to take some kind of blood- 
glucose-controlling medications. Undiagnosed or uncontrolled gestational diabetes can raise the risk of 
complications during child birth. 


2. PROCESS WORK FLOW 
Figure 1 shows the process of conceptual framework. 


Define the 
Problem 


Model Construction 
by Selecting 
Algonthm 


Evaluation 


Figure 1. The process of conceptual framework 


3. MODEL CONSTRUCTION 

Model Construction will take place using Logistic Regression, K Nearest Neighbors (KNN), SVM, 
Gradient Boost, Decision tree, MLP, Random Forest and Gaussian Naive Bayes and their performance will 
be evaluated [5], [6]. 


3.1. Logistic regression 

Logistic regression is basically a linear model for classification rather than regression. It is also 
known as the logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this 
model, we use logistic regression to model probabilistically described outcomes of a single trial. It is a basic 
model which describes dichotomous output variables and can be extended for disease classification 


prediction [7], [8]. Suppose there are N input variables where their values are indicated by mo Mm, MoM, 


Let us assume that the P probability of that an event will occur and 1- P be a probability that event will not 
occur. Logistic regression model is given by 


log (5) = logit (P) = By + Bym, + © + Bymy (t) 


3.2. KNN 
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases 
based on a similarity measure (e.g., distance functions). Case is classified by a majority vote of its neighbors, 
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with the case being assigned to theclass most common amongst its K nearest neighbors measured by a 
distance function [9]. 


d(x, x") = Vf (x1 — x1)? + (x2 — x3)? Fo + (Xn — xf)? (2) 


If K = 1, then the case is simply assigned to the class of its nearest neighbor. Similarity is defined 
according to a distance metric between two data points. A popular choice is the Euclidean distance. More 
formally, given a positive integer K, an unseen observation x and a similarity metric d, KNN classifier 
performs the following two steps: 

a. It runs through the whole dataset computing d between x and each training observation. We’ll call the K 
points in the training data that are closest to x the set A. Note that K is usually odd to prevent tie 
situations. 

b. It then estimates the conditional probability for each class, that is, the fraction of points in A with that 
given class label. (Note I(x) is the indicator function which evaluates to 1 when the argument x is true and 
0 otherwise) 


; 1 i : 
P(Y = jIX = x) = 2Yieal(y® =j) (3) 
Finally, our input x gets assigned to the class with the largest probability. 


3.3. SVM 

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating 
hyperplane [10]. In the linear classifier model, we assumed that training examples plotted in space. These 
data points are expected to be separated by an apparent gap. It predicts a straight hyperplane dividing 2 
classes. The primary focus while drawing the hyperplane is on maximizing the distance from hyperplane to 
the nearest data point of either class. The drawn hyperplane called as a maximum-margin hyperplane [11]. 
The classification process of SVM classifier. Figure 2 shows the SVM hyper planes. 


Figure 2. SVM hyper planes 


wx; —b => 1if 6; 1 


Where ||w|| is normal vector to the hyperplane, 0; denotes classes and x; denotes features. The Distance 
between two hyperplanes is ET to maximize this distance denominator value should be minimized i.e, ||w| 
shouldbe minimized. For proper classification, we can build a combined equation: 


\|Wl mn for 6;(Wx; — b) > 1 vi = 1,2, ,n (5) 
3.4. Gradient boost 
Boosting refers to a family of algorithms that are able to convert weak learners to strong learners. 


The main principle of boosting is to fit a sequence of weak learners—models that are only slightly better than 
random guessing, such as small decision trees—to weighted versions of the data. More weight is given to 
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examples that were misclassified by earlier rounds. The predictions are then combined through a weighted 
majority vote (classification) or a weighted sum (regression) to produce the final prediction. Gradient Tree 
Boosting s a generalization of boosting to arbitrary differentiable loss functions. It can be used for both 
regression and classification problems. Gradient Boosting builds the model in a sequential way. 


Fn (x) = Fm-1() ae Ym Nm (X) (6) 


At each stage the decision tree h,,(x) is chosen to minimize a loss function L given the current model Fm-1(x): 


Fn) = Fm-1 (x) + argminy Dies L(Yi Fm—1 i) + hi) (7) 
The algorithms for regression and classification differ in the type of loss function used. 


3.5. Decision tree 

Decision tree is a simple, deterministic data structure for modelling decision rules for a specific 
classification problem. At each node, one feature is selected to make separating decision. We can stop 
splitting once the leaf node has optimally less data points. Such leaf node then gives us insight into the final 
result (Probabilities for different classes in case of classification). The most decisive factor for the efficiency 
of a decision tree is the efficiency of its splitting process as shown in Figure 3. We split at each node in such 
a way that the resulting purity is maximum. Well, purity just refers to how well we can segregate the classes 
and increase our knowledge by the split performed [12]. 


Figure 3. Decision tree 


3.6. MLP 

The Multilayer Perception (MLP) is perhaps the most popular network architecture in use today both 
for classification and regression. MLPs are feed forward neural networks which are typically composed of 
several layers of nodes with unidirectional connections, often trained by back propagation [13], [14]. The 
learning process of MLP network is based on the data samples composed of the N-dimensional input vector x 
and the M-dimensional desired output vector d, called destination. By processing the input vector x, the MLP 
produces the output signal vector y(x, w) where w is the vector of adapted weights. The error signal produced 
actuates a control mechanism of the learning algorithm. The corrective adjustments are designed to make the 
output signal y,(k = 1, 2,..., M) to the desired responsed; in a step by step manner. If a multilayer perceptron 
has a linear activation function in all neurons, that is, a linear function that maps the weighted inputs to the 
output of each neuron, then linear algebra shows that any number of layers can be reduced to a two-layer 
input-output model. In MLPs some neurons use a nonlinear activation function that was developed to model 
the frequency of action potentials, or firing, of biological neurons [15]. The two common activation functions 
are both sigmoids, and are described by 


y(vi) = tanh(y;) and y(v;) = (1 +e”)? (8) 


The first is a hyperbolic tangent that ranges from -1 to 1, while the other is the logistic function, 
which is similar in shape but ranges from 0 to 1. Here y; is the output of the ith node (neuron) and v; is the 
weighted sum of the input connections. The learning algorithm of MLP is based on the minimization of the 
error function defined on the learning set (x;, d;) for i =1, 2,..., N using the Euclidean norm: 
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The minimization of this error leads to the optimal values of weights. The most effective methods of 
minimization are the gradient algorithms, from which the most effective is the Levenberg—Marquard 
algorithm for medium size networks and conjugate gradient for large size networks. Figure 4 shows the MLP 
structure. 


Input > Output 


values > values 


Figure 4. MLP structure 


3.7. Random forest 

Random forest is just an improvement over the top of the decision tree algorithm. The core idea 
behind Random Forest is to generate multiple small decision trees from random subsets of the data (hence the 
name “Random Forest”). Each of the decision tree gives a biased classifier (as it only considers a subset of 
the data). They each capture different trends in the data as shown in Figure 5. 


(768 instances) 


al Le 
wariable variable 


Figure 5. Random forest 


This ensemble of trees is like a team of experts each with a little knowledge over the overall subject 
but thorough in their area of expertise. Now, in case of classification the majority vote is considered to 
classify a class. In analogy with experts, it is like asking the same multiple choice question to each expert and 
taking the answer as the one that most no. of experts vote as correct. In case of Regression, we can use the 
avg. of all trees as our prediction. In addition to this, we can also weight some more decisive trees high 
relative to others by testing on the validation data [16]. Majority vote is taken from the experts (trees) for 
classification. 


3.8. Gaussian naive bayes 

In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be 
distributed according to a Gaussian distribution [17]. A Gaussian distribution is also called Normal 
distribution. When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature 
values as shown in Figure 6. 
The likelihood of the features is assumed to be Gaussian, hence, conditional probability is given by: 


1 


ji 2 
P(xily) = | 


2 
2 oy 


7 exp (- (10) 
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Figure 6. Gaussian curve 


4. PERFORMANCE EVALUATION CRITERIA FOR MODEL 
To analyze and compare the performance of the data mining methods presented in our study, we 
apply various statistics such as MAE, RMSE, NRMSE and Confusion Matrix computed as follows [18]-[20]. 
a. Mean absolute error (MAE) 
MAE measures the average magnitude of the errors in a set of predictions, without considering their 
direction. It’s the average over the test sample of the absolute differences between prediction and actual 
observation where all individual differences have equal weight. 


MAE = Z Taly- 9l (11) 


b. Root mean square error (RMSE) 
RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root 
of the average of squared differences between prediction and actual observation. 


1 Pa 
RMSE = Jz; ĵ;)? (12) 


c. Confusion matrix 
The information about actual and predicted classification system is hold by the Confusion matrix. It 
demonstrates the accuracy of the solution to a classification problem. Table 1 shows the confusion matrix 
for a two class classifier. The entries in the confusion matrix have the following meaning in the context of 
our study. Tp is the number of correct predictions that an instance is positive. Fn is the number of 
incorrect predictions that an instance is negative. Fp is the number of incorrect predictions that an 
instance is positive and t, is the number of correct predictions that an instance is negative. 


Table 1. The Confusion Matrix for a two class Classifier 


Predicted 
Positive Negative 
Positive t f, 
Actual À p R 
Negative fp ti 


d. Precision 
Precision looks at the ratio of correct positive observations. The formula is, 


P = (13) 


tp+fp 


e. Recall/true positive rate/sensitivity 
Recall is also known as sensitivity or true positive rate. It’s the ratio of correctly predicted positive events. 


R = (14) 


tytfn 
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f. Accuracy 
The proportion of the total number of predictions that were correct is known to be as Accuracy (AC). It 
shows overall effectiveness of classifier. It is determined using the equation: 


tpttn 

= E (15) 

g. ROC 
A receiver operating characteristics (ROC) graph is a method for conceptualize, organizing and selecting 
classifiers on the basis of their performance [21], [22]. ROC graphs are bi-dimensional graphs where on 
the Y axis t, rate is plotted and on the X axis f, rate is plotted. A ROC graph describe relative trade-offs 
between benefits (true positives) and costs (false positives) [23]. 


5. EXPERIMENTAL RESULTS AND OBSERVATIONS 

In Experimental studies the dataset have been partitioned between 70—30 % (538-230) for training 
and testing purpose. Table 2 shows Logistic Regression being the simplest classifier have performed well 
with an accuracy of 79.54%, while having relative absolute error 21.65%. Among the applied algorithms 
Logistic Regression has higher accuracy which is quite well and having the lowest RMSE value 46.52%. 
Table 2 shows comparative analysis of algorithm in terms of Mean Absolute Error, Root Mean Square Error 
and Accuracy score [4]. ROC is plotted for all the algorithms. More the area covered better is the classifier. 
These measurements are taken by using Spyder tool on Pima Indian Diabetes Data set taken from UCI 
repository. The results are shown in Table 2. The results may be improved by applying large size updated 
data sets of realistic context. However we need to apply other machine learning algorithms using real data set 
before generalizing the results. 


Table 2. Summary of Prediction for different Algorithms 


Algorithm MAE RMSE Accuracy Score 

Logistic Regression 0.2165 0.4652 0.7954 
KNNeighbors 0.2511 0.5011 0.7489 
Linear SVM 0.3203 0.5660 0.6797 
Gradient Boosting 0.2078 0.4558 0.7922 
Decision tree 0.2684 0.5181 0.7316 
MLP 0.3593 0.5994 0.6407 
Random Forest 0.2381 0.4880 0.7619 
Gaussian Naive Bayes 0.2381 0.4880 0.76 


Figure 7 shows the comparative analysis in terms of accuracy. 


m MAE 


E RMSE 
m ACCURACY SCORE 


Figure 7. Comparative Analysis in terms of Accuracy 


Table 3 shows Comparison of Algorithms for training time, Training and Score. Logistic Regression 
gives the best testing score of 77%. Neural Net Classifier takes the longest time to train the dataset. Recall, 
Precision, Accuracy calculated using confusion matrix and the comparison is done. 
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Table 3. Comparison of Algorithms for training time and Score 


Classifier Train_Score Test_Score Training time 
Naive Bayes 0.7672 0.7619 0.0041 
Logistic Regression 0.7672 0.7836 0.0190 
Random Forest 0.9963 0.7706 0.1146 
K Nearest Neighbors 0.7896 0.7489 0.0030 
Gradient Boosting 0.9330 0.7836 0.3414 
Decision tree 1.0000 0.7403 0.0079 
Linear SVM 1.0000 0.6797 0.1777 
Neural Net 0.7523 0.7143 0.9177 


Figure 8 shows the comparative analysis in terms of score and training time. 
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Figure 8. Comparative Analysis in terms of Score and training time 


Table 4 shows the results for PIMA on algorithms. 


Table 4. Results for PIMA on algorithms 


Classifier Precision Recall Fl—Measure ROC 
Naïve Bayes 0.7299 0.6962 0.7070 0.70 
Logistic Regression 0.7622 0.7157 0.7298 0.75 
Random Forest 0.7288 0.6998 0.7096 0.70 
K Nearest Neighbors 0.7110 0.6903 0.6978 0.69 
Gradient Boosting Classifier 0.7736 0.7471 0.7540 0.75 
Decision Tree 0.6960 0.7061 0.70 0.71 
Linear SVM 0.3398 0.50 0.4046 0.50 
Neural Net 0.6123 0.6249 0.6128 0.62 


Figure 9 shows the comparative analysis of algorithms in terms of ROC. 
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Figure 9. Comparative analysis of algorithms in terms of ROC 
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Figure 10 shows the comparative analysis of algorithms in terms of recall, precision, accuracy, ROC. 


—— Recall 
= Precision 
= Aaccuracy 


= ROC 


Figure 10. Comparative analysis of algorithms in terms of Recall, Precision, Accuracy, ROC 


6. CONCLUSION 

In this paper, we have inspected the execution of eight machine learning algorithms namely Logistic 
Regression, K Nearest Neighbors (KNN), SVM, Gradient Boost, Decision tree, MLP, Random Forest and 
Gaussian Naive to predict the population who are most likely to develop diabetes on Pima Indian diabetes 
data. The performance measurement is compared in terms of MAE, RMSE, ROC, Test Accuracy, Precision 
and Recall obtained from the test set. Here the studies conclude that Logistic Regression and Gradient Boost 
classifiers achieve higher test accuracy of 79 % than other classifiers. Further, we plan to recreate our study 
of Classification models by introducing the intelligent machine learning algorithms applied to a large 
collection of real life data set. Using Gaussian Fuzzy decision tree algorithm for the diagnosis accuracy 
obtained was 75% [24]. Design of a Diabetic Diagnosis System Using Rough Sets accuracy obtained was 
76% [25]. The results obtained by our experimental algorithms can be further improved by applying outlier 
detection before classification. This study can be used to select best classifier for predicting diabetes. 
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